A comprehensive guide to understanding the differences between these three essential Pandas functions with detailed examples and explanations.
The following tables will be used in all examples in this guide. They contain realistic and useful data to
demonstrate the differences between concat(), merge(), and join().
| Employee ID | Name | Department |
|---|---|---|
| 101 | Fatimah Al-Zahrani | Finance |
| 102 | Ahmed Al-Zahrani | Human Resources |
| 103 | Mohammed Al-Zahrani | IT |
| 104 | Sheikha Al-Zahrani | Marketing |
| Employee ID | Salary | Bonus |
|---|---|---|
| 101 | 5000 | 500 |
| 102 | 6000 | 400 |
| 103 | 7000 | 600 |
| 105 | 8000 | 700 |
The following examples will use these two tables to clarify the differences between concat(),
merge(), and join(). Understanding these tables is essential to follow the examples.
concat() Function
Purpose: The concat() function is used to concatenate two or more DataFrames along
a specific axis, either rows or columns. It is commonly used for combining datasets when you want to stack or
add new rows/columns.
Key Features:
axis=0) or columns (axis=1).ignore_index parameter.Limitations:
Example Code:
import pandas as pd
# DataFrames based on the example tables
df_employees = pd.DataFrame({
'Employee ID': [101, 102, 103, 104],
'Name': ['Fatimah Al-Zahrani', 'Ahmed Al-Zahrani', 'Mohammed Al-Zahrani', 'Sheikha Al-Zahrani'],
'Department': ['Finance', 'Human Resources', 'IT', 'Marketing']
})
df_salaries = pd.DataFrame({
'Employee ID': [101, 102, 103, 105],
'Salary': [5000, 6000, 7000, 8000],
'Bonus': [500, 400, 600, 700]
})
# Concatenate along rows (axis=0)
result = pd.concat([df_employees, df_salaries], ignore_index=True)
print(result)
Output :
Employee ID Name Department Salary Bonus
0 101 Fatimah Al-Zahrani Finance NaN NaN
1 102 Ahmed Al-Zahrani Human Resources NaN NaN
2 103 Mohammed Al-Zahrani IT NaN NaN
3 104 Sheikha Al-Zahrani Marketing NaN NaN
4 101 NaN NaN 5000 500
5 102 NaN NaN 6000 400
6 103 NaN NaN 7000 600
7 105 NaN NaN 8000 700
Explanation: In this example, the concat() function is used to combine the two
tables, Employees and Salaries. Since they have different columns, NaN
values are inserted where data is missing. This demonstrates how concatenation works by stacking the rows of
one DataFrame below the other.
merge() Function
Key Features:
inner, outer, left, and right.Limitations:
concat() in some cases.Example Code:
import pandas as pd
# DataFrames based on the example tables
df_employees = pd.DataFrame({
'Employee ID': [101, 102, 103, 104],
'Name': ['Fatimah Al-Zahrani', 'Ahmed Al-Zahrani', 'Mohammed Al-Zahrani', 'Sheikha Al-Zahrani'],
'Department': ['Finance', 'Human Resources', 'IT', 'Marketing']
})
df_salaries = pd.DataFrame({
'Employee ID': [101, 102, 103, 105],
'Salary': [5000, 6000, 7000, 8000],
'Bonus': [500, 400, 600, 700]
})
# Merge the DataFrames on 'Employee ID' column (inner join by default)
result_inner = pd.merge(df_employees, df_salaries, on='Employee ID')
# Merge the DataFrames with a left join
result_left = pd.merge(df_employees, df_salaries, on='Employee ID', how='left')
print("Inner Join Result:")
print(result_inner)
print("\nLeft Join Result:")
print(result_left)
Output :(Inner Join)
Employee ID Name Department Salary Bonus
0 101 Fatimah Al-Zahrani Finance 5000 500
1 102 Ahmed Al-Zahrani Human Resources 6000 400
2 103 Mohammed Al-Zahrani IT 7000 600
Output :(Left Join)
Employee ID Name Department Salary Bonus
0 101 Fatimah Al-Zahrani Finance 5000 500.0
1 102 Ahmed Al-Zahrani Human Resources 6000 400.0
2 103 Mohammed Al-Zahrani IT 7000 600.0
3 104 Sheikha Al-Zahrani Marketing NaN NaN
Explanation:
Employee ID values in both
DataFrames are included. Thus, the row for Employee ID 105 from df_salaries and
Employee ID 104 from df_employees are excluded.
df_employees) are
retained, even if there's no match in the right DataFrame (df_salaries). Missing values in the
right DataFrame are filled with NaN.
join() Function
Purpose: The join() function is used to combine two DataFrames based on their
indices or a specified key column. It is particularly useful for index-based merges and simplifies the process
when the indices of both DataFrames are aligned or need alignment.
Key Features:
left, right, outer, and
inner.Limitations:
Example Code:
import pandas as pd
# Create two DataFrames: one with employee details and another with salary information
df_employees = pd.DataFrame({
'Employee ID': [101, 102, 103, 104],
'Name': ['Fatimah Al-Zahrani', 'Ahmed Al-Zahrani', 'Mohammed Al-Zahrani', 'Sheikha Al-Zahrani'],
'Department': ['Finance', 'Human Resources', 'IT', 'Marketing']
})
df_salaries = pd.DataFrame({
'Employee ID': [101, 102, 103, 105],
'Salary': [5000, 6000, 7000, 8000],
'Bonus': [500, 400, 600, 700]
})
# Join the DataFrames on 'Employee ID'
result = df_employees.set_index('Employee ID').join(df_salaries.set_index('Employee ID'))
print(result)
Output :
Name Department Salary Bonus
Employee ID
101 Fatimah Al-Zahrani Finance 5000 500
102 Ahmed Al-Zahrani Human Resources 6000 400
103 Mohammed Al-Zahrani IT 7000 600
104 Sheikha Al-Zahrani Marketing NaN NaN
Explanation:
join() function is used here to combine the df_employees and
df_salaries DataFrames using their common column
Employee ID> as the index. Both DataFrames are first set to have Employee ID as their
index using set_index().df_employees DataFrame are retained, while the corresponding salary and
bonus data from the df_salaries DataFrame is added based on matching Employee ID.
Employee ID 104, there is no corresponding entry in df_salaries, so the
resulting columns for Salary and Bonus are NaN for that row.concat(), merge(), and join()| Feature | concat() |
merge() |
join() |
|---|---|---|---|
| Purpose | Merges DataFrames along a specified axis (row-wise or column-wise). | Performs SQL-like joins based on common columns or indices. | Joins DataFrames using their index. |
| Primary Use Case | Stacking or appending DataFrames. | Combining datasets with key-based relationships. | Efficiently joining on index without manual key matching. |
| Default Behavior | Concatenates along rows (axis=0). |
Performs an inner join by default. |
Performs a left join by default. |
| Join Type Supported | Not applicable (direct concatenation). | inner, outer, left, right. |
inner, outer, left, right. |
| Index Handling | Keeps original index unless ignore_index=True is used. |
Can merge on index or column with on, left_on, right_on. |
Uses index by default but can specify a column with on=. |
| Duplicates Handling | Retains all rows unless manually removed. | Keeps duplicate rows unless handled separately. | Duplicates remain unless manually handled. |
| NaN Handling | Preserves NaN values unless fillna() is used. |
Introduces NaNs for non-matching keys in outer joins. |
Introduces NaNs for missing index matches unless how='inner'. |
| MultiIndex Support | Yes, supports MultiIndex concatenation. | Yes, supports MultiIndex merging. | Yes, primarily for index-based operations. |
| Performance | Faster for simple stacking operations. | Slower due to key-based lookups. | Faster than merge() for index-based joins. |
| Example Usage | pd.concat([df1, df2], axis=0) |
df1.merge(df2, on='key', how='inner') |
df1.join(df2, how='left') |
Question 1: Which function would you use to combine two DataFrames based on a common column?
Question 2: Which function is best for combining DataFrames along a particular axis, without considering index or column labels?
Question 3: Which function is more suited for combining DataFrames based on the index?
Question 4: Which function allows you to specify the type of join (inner, outer, etc.)?
Question 5: Which function would you use to combine DataFrames with different columns but the same index?